This document presents a Exploratory Data Analysis performed on collected data for one of the models in the proposed pipeline - Sentiment Analysis Model.
This process ensures that the data used to train the model is of high quality and free from errors or inconsistencies.
The larger the corpus, the better the performance of the NLP model.
Data preparation involves selecting and engineering the appropriate features, such as n-grams, word embeddings, and syntactic features, to improve the performance of the NLP model.
Data preparation involves splitting the data into training, validation, and testing sets to ensure that the model is trained on a representative sample of the data.
In summary, data preparation is a critical step in NLP as it ensures that the model is trained on high-quality, representative data and that the features are carefully engineered to capture the nuances of natural language.
The datasets collected during data sourcing has already been initially prepared during the EDA which is described in a previous file.
In this notebook we are loading a pre prepared data and continue to pre process it for modelling in later phase.
Loading libraries & Data
Encoding classes
Plotting class distribution
Balancing classes
Categorizing classes
Pre-Processing text features
Saving data file
Conclusion
# NumPy is a library for numerical computing in Python, providing a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays.
import numpy as np
# Pandas is a fast, powerful, flexible, and easy-to-use open-source data analysis and data manipulation library for Python. It provides data structures for efficiently storing and manipulating large and complex data sets.
import pandas as pd
# Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy.
import matplotlib.pyplot as plt
from matplotlib import rc
# Scikit-learn is a free software machine learning library for the Python programming language. It features various classification, regression, and clustering algorithms.
from sklearn.model_selection import train_test_split
# Plotly Express is a high-level data visualization library for Python.
import plotly.express as px
# Natural Language Toolkit (NLTK) is a leading platform for building Python programs to work with human language data.
# It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet,
# along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, semantic reasoning, and wrappers for industrial-strength NLP libraries.
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
# Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.
from bs4 import BeautifulSoup
# String is a Python library that contains a set of string constants, including the punctuation characters.
import string
# Make plot interactive in HTML format
import plotly.io as pio
pio.renderers.default='notebook'
# Importing data from EDA output (initially prepared for EDA)
data = pd.read_csv('SA_model_data_after_EDA')
# Dropping dummy column
data = data.drop('Unnamed: 0', axis=1)
# Printing loaded df
data
| emotion | text | needed_model | num_words | text_length | |
|---|---|---|---|---|---|
| 0 | anger | point today someone says something remotely ki... | yes | 10 | 69 |
| 1 | anger | game day minus 14 30 relentless | yes | 6 | 31 |
| 2 | anger | game pissed game year blood boiling time turn ... | yes | 10 | 58 |
| 3 | anger | found candice candace pout likes | yes | 5 | 32 |
| 4 | anger | cannot come muma 60th 25k tweets soreloser | yes | 7 | 42 |
| ... | ... | ... | ... | ... | ... |
| 66009 | happiness | succesfully following tayla | yes | 3 | 27 |
| 66010 | love | happy mothers day love | yes | 4 | 22 |
| 66011 | love | happy mothers day mommies woman man long momma... | yes | 10 | 58 |
| 66012 | happiness | wassup beautiful follow me peep new hit single... | yes | 13 | 73 |
| 66013 | love | bullet train tokyo gf visiting japan since thu... | yes | 12 | 88 |
66014 rows × 5 columns
# Renaming neutral class to rq - rectorial question
data.loc[data["emotion"] == "neutral", "emotion"] = "rq"
# Changing classifier for if "needed_model"
data.loc[data["emotion"] == "rq", "needed_model"] = "yes"
# Renaming hate class to anger
data.loc[data["emotion"] == "hate", "emotion"] = "anger"
# Changing classifier for if "needed_model"
data.loc[data["emotion"] == "anger", "needed_model"] = "yes"
# Creating a subset of dataframe for plotting
df = data.groupby(['emotion', 'needed_model'])['emotion'].count().reset_index(name="count").sort_values('count', ascending=False)
# Calculating mean ammount of classes
avg = data["emotion"].value_counts().values.mean()
# Initiating the bar chart
fig = px.bar(df, # Dataframe
y='count', # y value
x='emotion', # x value
text_auto='.2s', # text for labels
color= 'needed_model', # plot color palette
title="Class distribution in the dataframe") # title for the chart
# Adding a horizontal "target" line
fig.add_shape(type="line", # selecting type
line_color="salmon", # selecting color
line_width=3, # selecting width
opacity=1, # selecting opacity
line_dash="dot", # dash of line
x0=0, # selecting position, start and end point, values at start and end
x1=1,
xref="paper",
y0=avg,
y1=avg,
yref="y",
)
# Adding plot legend and renaming x,y labels
fig.update_layout(legend_title_text='Is the class needed for model?', # legend label
title="Class distribution in the dataframe", # chart title
xaxis_title="Class label", # x label
yaxis_title="Class count", # ylabel
font=dict(family="Courier New, monospace", # font settings
size=18,
color="black")
)
# Adding annotation to the mean line
fig.add_annotation(xref="paper", # selecting style
x=0.98, # selecting x and y positions
y=5500,
text="Average number of classes: "+str(round(avg,1)), # pasting text with avg parameter
showarrow=False # disabeling arrow
)
# Updating layout style
fig.update_layout(barmode="group",
clickmode="event+select",
xaxis_tickangle=-45)
# Selecting settings for legend
fig.update_layout(height=1000, # changing height of the whole vis
legend=dict(x=0.85, # selecting legend position
y=1.3,
traceorder="reversed",
title_font_family="Courier New, monospace", # font settings
font=dict(family="Courier",
size=12,
color="black"),
bgcolor="white", # color settings
bordercolor="navy",
borderwidth=4)
)
# Finally, showing the figure
fig.show()
# Filter the data to only include the needed model columns for the emotion and text, and the rows where the 'needed_model' value is 'yes'.
data = data[data['needed_model']=='yes'][['emotion', 'text']]
# Find the unique classes (emotions) in the dataset and count the number of classes.
classes=sorted(list(data['emotion'].unique()))
class_count = len(classes)
# Print the number of classes in the dataset.
print('The number of classes in the dataset is: ', class_count)
print('')
# Group the data by emotion to get the count of each class.
groups=data.groupby('emotion')
print('{0:^30s} {1:^13s}'.format('CLASS', 'VALUE COUNT'))
countlist=[]
classlist=[]
for label in sorted(list(data['emotion'].unique())):
group=groups.get_group(label)
countlist.append(len(group))
classlist.append(label)
print('{0:^30s} {1:^13s}'.format(label, str(len(group))))
# Get the classes with the minimum and maximum number of train images.
max_value=np.max(countlist)
max_index=countlist.index(max_value)
max_class=classlist[max_index]
min_value=np.min(countlist)
min_index=countlist.index(min_value)
min_class=classlist[min_index]
# Print results
print('')
print(max_class, ' has the most examples= ', max_value, ' ', min_class, ' has the least examples= ', min_value)
The number of classes in the dataset is: 7
CLASS VALUE COUNT
anger 5792
fear 4600
happiness 13400
love 5314
rq 8271
sadness 12369
worry 8347
happiness has the most examples= 13400 fear has the least examples= 4600
# This function takes a Pandas DataFrame (df), a maximum number of samples to keep per class (max_samples), a minimum number of samples to keep per class (min_samples), and the column in the DataFrame that contains the class labels (column).
def trim(df, max_samples, min_samples, column):
# A copy of the input DataFrame is made.
df=df.copy()
# The DataFrame is grouped by the column containing the class labels.
groups=df.groupby(column)
# A new empty DataFrame is created to store the trimmed data.
trimmed_df = pd.DataFrame(columns = df.columns)
# The data is grouped by class, and for each class:
for label in df[column].unique():
# Get the group for the current label.
group=groups.get_group(label)
# Get the number of samples in the current group.
count=len(group)
# If the number of samples in the current group is greater than the maximum number of samples allowed,
# randomly sample the group to the maximum number of samples, and add the result to the trimmed DataFrame.
if count > max_samples:
sampled_group=group.sample(n=max_samples, random_state=123,axis=0)
trimmed_df=pd.concat([trimmed_df, sampled_group], axis=0)
# If the number of samples in the current group is between the minimum and maximum,
# add the entire group to the trimmed DataFrame.
else:
if count>=min_samples:
sampled_group=group
trimmed_df=pd.concat([trimmed_df, sampled_group], axis=0)
# Print a message to indicate the maximum and minimum number of samples after trimming.
print('After trimming, the maximum samples in any class is now ',max_samples, ' and the minimum samples in any class is ', min_samples)
# Return the trimmed DataFrame.
return trimmed_df
# Parameter for max samples
max_samples=4000
# Parameter for min samples
min_samples=4000
# Parameter for column to 'resize'
column='emotion'
# Calling function on dataset
data = trim(data, max_samples, min_samples, column)
# Find the unique classes (emotions) in the dataset and count the number of classes.
classes=sorted(list(data['emotion'].unique()))
class_count = len(classes)
# Print the number of classes in the dataset.
print('The number of classes in the dataset is: ', class_count)
print('')
# Group the data by emotion to get the count of each class.
groups=data.groupby('emotion')
print('{0:^30s} {1:^13s}'.format('CLASS', 'VALUE COUNT'))
countlist=[]
classlist=[]
for label in sorted(list(data['emotion'].unique())):
group=groups.get_group(label)
countlist.append(len(group))
classlist.append(label)
print('{0:^30s} {1:^13s}'.format(label, str(len(group))))
After trimming, the maximum samples in any class is now 4000 and the minimum samples in any class is 4000
The number of classes in the dataset is: 7
CLASS VALUE COUNT
anger 4000
fear 4000
happiness 4000
love 4000
rq 4000
sadness 4000
worry 4000
data['label'] = pd.Categorical(data['emotion']).codes
test_keys = pd.Categorical(data['emotion']).categories
test_values = pd.Categorical(data['label']).categories.values.tolist()
decoder = {}
for key in test_keys:
for value in test_values:
decoder[key] = value
test_values.remove(value)
break
data = data[['label', 'text']]
decoder
{'anger': 0,
'fear': 1,
'happiness': 2,
'love': 3,
'rq': 4,
'sadness': 5,
'worry': 6}
tokenization is an essential step before any further processing of the text. One way to tokenize text is by using the RegexpTokenizer class from the nltk.tokenize module in Python.
lemmatization is often used to reduce the inflectional forms of words to a common base form for analysis and comparison.
The WordNetLemmatizer class from the nltk.stem module in Python is a popular tool for lemmatization.
This process involves removing suffixes and prefixes from the words to get their base form. Stemming helps to reduce the number of words that need to be processed and analyzed,
which can be useful in various applications such as information retrieval and text classification.
PorterStemmer is a stemming algorithm that is widely used in NLP. It is an iterative algorithm that applies a set of rules to a word to obtain its stem.
The algorithm works by identifying suffixes and removing them until the stem is obtained. PorterStemmer is a rule-based algorithm and applies a set of rules based on the length and structure of the word being stemmed.
# Parameter for tokenizer (might be adjusted later based on pre-trained model requirement)
tokenizer = RegexpTokenizer(r'\w+')
# Parameter for lemmanizer (might be adjusted later based on pre-trained model requirement)
lemmatizer = WordNetLemmatizer()
# Parameter for stemmer (might be adjusted later based on pre-trained model requirement)
stemmer = PorterStemmer()
# This function uses BeautifulSoup to remove any HTML tags from the input text and returns the cleaned text.
def remove_html(text):
soup = BeautifulSoup(text, 'lxml')
html_free = soup.get_text()
return html_free
# This function takes in a list of words as input and removes any stopwords (common words that do not add much meaning to the text) from the list. It returns the list of non-stopwords.
def remove_stopwords(text):
words = [w for w in text if w not in stopwords.words('english')]
return words
# This function takes in a list of words and lemmatizes them (i.e., reduces each word to its base form, or lemma) using WordNetLemmatizer. It returns the list of lemmatized words.
def word_lemmatizer(text):
lem_text = [lemmatizer.lemmatize(i) for i in text]
return lem_text
# This function takes in a list of words and stems them (i.e., reduces each word to its root form) using PorterStemmer. It returns a string containing the stemmed words separated by spaces.
def word_stemmer(text):
stem_text = " ".join([stemmer.stem(i) for i in text])
return stem_text
# Turning off df slice copy worning message
pd.set_option('mode.chained_assignment', None)
# Converting data to string format
data['text'] = data['text'].apply(lambda x: str(x))
# Calling remove_html function on all input text
data['text'] = data['text'].apply(lambda x: remove_html(x))
# Calling tokenizer function on all input text
data['text'] = data['text'].apply(lambda x: tokenizer.tokenize(x))
# Calling remove_stopwords function on all input text
data['text'] = data['text'].apply(lambda x: remove_stopwords(x))
# Calling word_lemmatizer function on all input text
data['text'] = data['text'].apply(lambda x: word_lemmatizer(x))
# Calling word_stemmer function on all input text
data['text'] = data['text'].apply(lambda x: word_stemmer(x))
data.to_csv('SA_model_data')